-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[BOLT] Improve handling of relocations targeting specific instructions #66395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm not familiar yet with RISC asm, could you please describe why loading value from symbol results in text relocation (e.g. reference to the instruction)? Just interesting :) |
On RISC-V, there are certain relocations that target a specific instruction instead of a more abstract location like a function or basic block. Take the following example that loads a value from symbol `foo`: ``` nop 1: auipc t0, %pcrel_hi(foo) ld t0, %pcrel_lo(1b)(t0) ``` This results in two relocation: - auipc: `R_RISCV_PCREL_HI20` referencing `foo`; - ld: `R_RISCV_PCREL_LO12_I` referencing to local label `1` which points to the auipc instruction. It is of utmost importance that the `R_RISCV_PCREL_LO12_I` keeps referring to the auipc instruction; if not, the program will fail to assemble. However, BOLT currently does not guarantee this. BOLT currently assumes that all local symbols are jump targets and always starts a new basic block at symbol locations. The example above results in a CFG the looks like this: ``` .BB0: nop .BB1: auipc t0, %pcrel_hi(foo) ld t0, %pcrel_lo(.BB1)(t0) ``` While this currently works (i.e., the `R_RISCV_PCREL_LO12_I` relocation points to the correct instruction), it has two downsides: - Too many basic blocks are created (the example above is logically only one yet two are created); - If instructions are inserted in `.BB1` (e.g., by instrumentation), things will break since the label will not point to the auipc anymore. This patch proposes to fix this issue by teaching BOLT to track labels that should always point to a specific instruction. This is implemented as follows: - Add a new annotation type (`kLabel`) that allows us to annotate instructions with an `MCSymbol *`; - Whenever we encounter a relocation type that is used to refer to a specific instruction (`Relocation::isInstructionReference`), we register a new type of label (`InstructionLabels`) with the corresponding `BinaryFunction`; - During disassembly, add these instruction labels to the correct instructions; - During emission, emit these labels right before the instruction. I believe the use of annotations works quite well for this use case as it allows us to reliably track instruction labels. If we were to store them as offsets in basic blocks, it would be error prone to keep them updated whenever instructions are inserted or removed. I have chosen to add labels as first-class annotations (as opposed to a generic one) because the documentation of `MCAnnotation` suggests that generic annotations are to be used for optional metadata that can be discarded without affecting correctness. As this is not the case for labels, a first-class annotation seemed more appropriate.
e714030
to
8233eec
Compare
To materialize a 32-bit pc-relative address on RISC-V, you need two instruction: one See the psABI for more info. |
Thanks! Am I understand it right that this is from linker standpoint? In the final binary ld/addi would have just an imm operand. |
That all sounds correct.
It's similar to
As far as I can tell, using something like |
Thanks for clarifications, Job! Now it seems to be completely clear to me :) |
I've noticed one downside/annoyance with this approach. When emitting loads/stores for instrumentation (e.g., in None of this is a blocker (I have a working instrumentation implementation using the approach above) but it feels slightly annoying so I'd be interested in thoughts on this. |
Is it possible to directly address needed symbol using -4 addend to symbol from the second instruction, since in this case you can be sure they would be emitted one by one? |
No, that isn't possible. For one, the linker needs a Another issue is that the offset isn't always -4. I'm actually generating code like this:
So the |
Ping |
Adding data structures to BinaryFunction class is a bit expensive for us because it increases our memory footprint -- an uninitialized std::map<> will cost 48 bytes, which means an additional 5MB for 100k BinaryFunction objects loaded at runtime. The x86 target primarily deals with very large binaries, so it will be good if this solution could be mindful of that and perhaps reduce the memory footprint for x86 that will likely not use this additional InstructionLabels data structure. Perhaps convert it to an std::optionalstd::map? Or better yet, maybe try to reuse the existing Relocations map that we have in BF and register your relocs there, then fetch that in ::disassemble() to try to fill up your new annotations? |
Alright, that makes sense. I got rid of
So basically, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM
On RISC-V, this function will have to insert `auipc`+`lw`/`sw` pairs where the `auipc` needs a label annotation (see llvm#66395). In order to do this correctly within the context of `ParallelUtilities`, we need an allocator id when setting the label. This patch adds this allocator id argument and wires it through to all call sites. Note that this function is also made non-const since `setLabel` is non-const.
On RISC-V, there are certain relocations that target a specific instruction instead of a more abstract location like a function or basic block. Take the following example that loads a value from symbol
foo
:This results in two relocation:
R_RISCV_PCREL_HI20
referencingfoo
;R_RISCV_PCREL_LO12_I
referencing to local label1
which points to the auipc instruction.It is of utmost importance that the
R_RISCV_PCREL_LO12_I
keeps referring to the auipc instruction; if not, the program will fail to assemble. However, BOLT currently does not guarantee this.BOLT currently assumes that all local symbols are jump targets and always starts a new basic block at symbol locations. The example above results in a CFG the looks like this:
While this currently works (i.e., the
R_RISCV_PCREL_LO12_I
relocation points to the correct instruction), it has two downsides:.BB1
(e.g., by instrumentation), things will break since the label will not point to the auipc anymore.This patch proposes to fix this issue by teaching BOLT to track labels that should always point to a specific instruction. This is implemented as follows:
kLabel
) that allows us to annotate instructions with anMCSymbol *
;Relocation::isInstructionReference
), we register a new type of label (InstructionLabels
) with the correspondingBinaryFunction
;I believe the use of annotations works quite well for this use case as it allows us to reliably track instruction labels. If we were to store them as offsets in basic blocks, it would be error prone to keep them updated whenever instructions are inserted or removed.
I have chosen to add labels as first-class annotations (as opposed to a generic one) because the documentation of
MCAnnotation
suggests that generic annotations are to be used for optional metadata that can be discarded without affecting correctness. As this is not the case for labels, a first-class annotation seemed more appropriate.